Blind Data Linkage Using n-gram Similarity Comparisons
نویسندگان
چکیده
The task of linking together information from one or more data sources representing the same entity (patient, customer, business, gene sequence, etc.) If no unique identifier is available, probabilistic linkage techniques have to be applied Real world data is often dirty Missing values Typographical and other errors Different coding schemes / formats Out-of-date data Names and addresses are especially prone to data entry errors
منابع مشابه
Some methods for blindfolded record linkage
BACKGROUND The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust ...
متن کاملDKPro Similarity: An Open Source Framework for Text Similarity
We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and p...
متن کاملGenomic Linkage Analysis of Iranian Clinical Isolates of Dermatophytes Fungi Using the RAPD-PCR
Dermatophytes are a group of keratinophilic fungi capable of invading keratinized tissues (skin, hair and nails). They cause dermatophytosis (commonly known as tinea or Ring worm) in human and animals. In this report, DNA similarities and genomic linkage of 40 dermatophytes strains was obtained from different universities, were studied by random amplified polymorphic DNA (RAPD–PCR) using 11 ra...
متن کاملRTE4: Normalized Dependency Tree Alignment Using Unsupervised N-gram Word Similarity Score
We propose an unsupervised similarity metric to measure the relevance of word pairs using the Web1T data. The alignment scores between the dependency trees of the text and the hypothesis sentences are calculated based on this new similarity metric and these scores are used to predict the entailment between the text and the hypothesis sentences. The new similarity metric together with other feat...
متن کاملString Metrics and Word Similarity applied to Information Retrieval
Over the past three decades, Information Retrieval (IR) has been studied extensively. The purpose of information retrieval is to assist users in locating information they are looking for. Information retrieval is currently being applied in a variety of application domains from database systems to web information search engines. The main idea of it is to locate documents that contain terms the u...
متن کامل